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Abstract 

In this manuscript, we analyze the sparse signal recovery (compressive sensing) problem 
from the perspective of convex optimization by stochastic proximal gradient descent. This 
view allows us to significantly simplify the recovery analysis of compressive sensing. More 



q 

c/^ I importantly, it leads to an efficient optimization algorithm for solving the regularized op- 

timization problem related to the sparse recovery problem. Compared to the existing 
approaches, there are two advantages of the proposed algorithm. First, it enjoys a geo- 
metric convergence rate and therefore is computationally efficient. Second, it guarantees 
that the support set of any intermediate solution generated by the proposed algorithm is 
QQ ■ concentrated on the support set of the optimal solution. 



1. Introduction and Related Work 



The problem of sparse signal recovery is to reconstruct a sparse signal given a number of lin- 
ear measurements of the signal. The problem has been studied extensively under two closely 
related settings, i.e., lasso and compressive sensing . Lasso is known as a tool of model 
selection that aims to learn a sparse model /3 G R from a data design matrix X G M"^ and 
noisy measurements y = Xji+e of (3 , where e are zero-mean independent Gaussian random 
C^ . variables, by solving the £i regularized least square problem min^gjgd ||y — -^/^Hi + -^ll/^Hi- 

Compressive sensing focuses more on the study of how many random measurements are 
needed to optimally recover a sparse signal x,,, S W^. In the manuscript, we provide a 
new perspective of compressive sensing from the viewpoint of convex optimization by gra- 
dient descent. Our analysis reveals that in order to solve the optimal recovery problem of 
min^giRd |||x — x^kIII in hindsight by a gradient descent method, the random measurements 
of the signal x^, denoted by U:x.^ are used for computing a stochastic gradient of the objec- 
tive. Furthermore, we develop a stochastic gradient descent method that solves a composite 
gradient mapping with £i regularization at each iteration, which ensures the support set 
of intermediate solution concentrates on the support set of the optimal solution. Finally, 
we prove that the proposed algorithm enjoys a geometric convergence rate. To the best of 
our knowledge, this work is the first that analyze the compressive sensing in the angle of 
optimization by stochastic gradient descent. 
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A great volume of work have been devoted to the problem of sparse signal recovery in 
different philosophies. In the following, we briefly review some related work that solves the 
optimization problem for reconstructing the optimal signal with a linear (i.e., geometric) 
convergence rate. In (Bredies and Lorenz, 2008; Hale et al., 2008), the authors established 
linear convergence rates as the iterates are close enough to the optimum. Tropp and Gilbert 
(2007) showed that if an algorithm can quickly identify the support set of the optimal 
solution, then the optimization is effectively reduced to a lower-dimensional subspace, and 
geometric convergence can be achieved. Garg and Khandekar (2009) showed a geometric 
convergence rate for the recovered solution by a sparsification. In (Agarwal et al., 2011), 
the authors showed that a simple gradient descent algorithm for the constrained Lasso 
can achieve a global geometric convergence rate in recovering the target solution (Corollary 
2) ^. One shortcoming with the analysis in (Agarwal et al., 2011) is that the parameter k in 
linear convergence is lower bounded by a constant (i.e., 3/4) independent from the number 
of random measurements, a disappointing feature as we expect a faster convergence with 
the increasing number of random measurements. 

The proposed approach is similar to several existing algorithms (Wen et al., 2010; Wright et al., 
2009; Hale et al., 2008; Xiao and Zhang, 2012) developed for ii regularized minimization 
in that all of them solve the regularized optimization problem by gradually shrinking the 
value of the regularization parameter. To the best of our knowledge, (Xiao and Zhang, 
2012) is the only work in this direction that provides theoretical guarantee. The main 
difference between this work and the work (Xiao and Zhang, 2012) is that instead of per- 
forming a simple gradient mapping for each value of the regularized parameter, the algo- 
rithm (Xiao and Zhang, 2012) requires, at each iteration, solving the LI regularized opti- 
mization problem to certain accuracy, leading to a significant computational overhead in 
optimization. 

2. Algorithm 

Let X* G M'^ be a s-sparse high dimensional signal to be recovered, where the number of 
non-zero elements in x,,, is s. We denote by S'(x) the support set for x that includes all the 
indices of the non-zero entries in x, i.e., 

5(x) = {i G [d] : [x], / 0} (1) 

where [d] denotes the set {1, . . . , d} and [x]j denote the i-th element in x. We also denote 
by 5(x) = [d\ \ 5(x) the complementary set of 5(x). In particular, we use 5=k,5=k to denote 
the support set and complementary set of x*. Similar to most of the previous analysis, we 
assume that ||x=i,||2 < R. 

To motivate our approach, we first consider the following optimization problem 

min £(x) = -||x — x^kIII (2) 

xeM'* 2 



1. In the same paper, the authors also discussed a gradient descent algorithm for the regularized Lasso, 
which unfortunately is only able to recover the solution up to the statistical tolerance. 
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Evidently, the optimal solution to (2) is x*. We now consider a gradient descent method 
for optimizing the problem in (2), leading to the following updating equation for xj 

xt+i = argmin ||x - (xj - V£(xi))||2 (3) 



where V£(x) = x — x.,,. Since the problem in (2) is both smooth and strongly convex, the 
above updating enjoys a geometric convergence rate ^, allowing an efficient reconstruction 

of X=|,. 

However, the updating rule in (3) can not be used because it requires knowing x*, the full 
information of the sparse signal to be recovered. In compressive sensing, the only available 
information about the target signal x* is its random measurements. More specifically, let 
U G W^^ be a random measurement matrix and y = C/x* be the corresponding m random 
measurements. Using the random measurements, we construct an approximate gradient as 

V£(xi) = C/Tt/(xi-x,) = [/^(t/xi-y) (4) 

To ensure VC{xt) provide an useful estimate of VC{xt), we assume the random mea- 
surement matrix U satisfies the following restricted isometry properties (RIP) (with an 
overwhelming probability) . 

Definition 1 (s-restricted isometry constant) Let 6s >0 be the sTnallest constant such 
that for any subset T € [d] with \T\ < s and x G W'', 

{1 - 6s)Ml < \\Urx\\l < {1 + 6s)Ml 

where Uj- denote the sub-matrix of U with columns from T ■ 

Definition 2 (s, s-restricted orthogonality constant) Let 9s,s be the smallest constant 
such that for any two disjoint subsets T,T' € [d] with \T\ < s, \T'\ < s, 2s < d, and for 

anyxGRl'^l, x' G RI'^'I, 

|([/rx, t/r'x')| < 6's,s||x||2||x'||2 

The above two constants are standard tools in the analysis of optimal recovery of com- 
pressive sensing. It has been shown that several random measurement matrix including 
Gaussian measurement matrix, binary measurement matrix, Fourier measurement matrix 
and incoherent measurement matrix satisfy the above RIP with small 5s and ^s^^- 

Next, we will use V£(x() as an approximation of V£(x() and update the solution by 
performing the following proximal mapping: 

^ 1 + ^ 
xt+i = argmin ri||x||i + (x - x^, V£(xt)) -\ -— ||x-xt||i (5) 

where r^ > is the regularization parameter that varies over the iterations and 7 > is a 
parameter essentially due to the RIP conditions. The updating rule given in (6) differs from 
(3) in that (i) the true gradient V£(xi) is replaced with an approximate gradient V£(xi) 
and (ii) a ii regularization term Ti||x||i is added. With appropriate choice of r^, this 
regularization term will essentially remove the noise arising from the approximate gradient 
and consequentially lead to the geometric convergence rate. 



2. In fact, only one step is needed. 
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Algorithm 1 A Composite Optimization Approach for Compressive Sensing 



1: Input: Gaussian random matrix U £ 
larization parameters ti, . . . ,tt, and 7 
2: Initialize xi = 0. 
3: for t = l,...,r do 



pdxm 



, random measurements y = U x^,, regu- 



Compute xj = xj 



1 



-U{U^xt - y) 



1 + 7 
5: Update the solution xj+i = sign{%) 

6: end for 

7: Output the final solution xy+i 



\^t\ 



n 



1 + 7 



Remark: We note that our approach is fundamentally different from the classical idea of 
stochastic gradient descent. In stochastic gradient descent, we have access to the stochastic 
oracle of the gradients. By drawing an unbiased estimate of the gradient independently from 
the statistical oracle at each iteration, stochastic gradient descent is able to reduce the noise 
in the stochastic gradients through the average by exploring the concentration inequality 
of martingales. In contrast, in compressive sensing, we are only provided with one set of 
random measurements for the target signal x*. Since all the estimates of gradients are 
based on the same set of random measurements, they are statistically dependent, making 
it impossible to explore the martingale technique for reducing the noise in the estimates of 
gradients. The ii regularization term in the updating rule in (5) is essentially introduced 
to reduce the noise in the statistical gradients, and therefore plays similar role as the 
concentration inequality of martingales. 

To give the solution of xt+i in a closed form, we write (5) as 



Xi_|_i = arg mm ; 



1 



X 



1 



Xt 



1 + 7 



v£(xt; 



2 1 + 7 



X 1 



According to , the value of x^+i is given by 

Xi+i = sign{%) 



Xt 



n 



1 + 7 



(6) 



(7) 



where x^ = xj — (1/(1 +7))V£(xt) and [v]+ = max(0, f). We present the detailed steps 
of the proposed approach in Algorithm 1 for reconstructing the sparse signal given a set 
of random measurements. To end this section, we present our main result in the following 
theomrem which states the theoretical guarantee of Algorithm 1. 

Theorem 1 Let x* G M.'^ be a s-sparse signal and y = Ux^, be a set of m random measure- 
ments ofx^:. Set ^,Tt in Algorithm 1 as 



7 = max{53s, ds,s + Ss), n 



+ 5s+l 



(47)(*-i)/2i?,t = i,...,r. 



If we assume 7 < 1/4, then (i) ||5jU5=i,|| < 2s and (ii) ||xi — x=k||2 < (47)^* -^)/^||x=i,| 
(Hi) ||xt-x,||i < V^(47)(*-i)/2||x,||2,t = l,...,r 



and 
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3. Analysis 

Before presenting our analysis, we introduce a few notations that will be used throughout 
the paper. Given a set 5 C [d], we denote [x]^ the vector that only includes the entries of x 
in the subset S. Given two subsets A C [d] and B CI [d], we denote by [-/Vf]_4^g a sub- matrix 
that includes all the entries {i,j) in matrix M with i £ A and j £ B. We first prove the 
following Theorem. 

Theorem 2 Let St be the support set ofxt andS-^ he the support set ofx^,. Define S^ = 5jU 
5*, 5f = 5c\5*. If we assume \StUS^,\ < 2s, at most s entries o/[(l+7)xt — C/^C/(xt— x^,)]^ 



with magnitude larger than 



V~s 



Xt -X^, 2-- 



Proof For any subset S' C 5* of size s, let S[ = S' D 5f and S'2 = S' \ 5f . We have 



< 

< 
< 



[U^Ui^t - ^*)]s' - (1 + 7)[xt]5' ^ = U],Us, [xt - x,]^^ + U],Us^ [xt]c. - (1 + 7)[xt]5' 



Uj,Us, \\[xt-^*]sM2+ ^s'-Pst 



\^t\s- 



+ 



UjaUs- [Xf]ca - (1 + 7)[Xt]5f 



9s,s\\[^t -X=^]5j|2 + 6*5,5 11 [xt]5f II 2 + {^s +7)l|[xt]5f II2 < {9s,s + 5s +7)l|xt -x*||2 



Since the above inequality holds for any subset S' C 5* of size s, we form the set 5' by 
including the largest s entries in absolute value of [(1 + 7)x( — U~^U{'Kt — x*)]^ . Then the 



smallest absolute value in [(1 + 7)x( — [/"""[/(xq — Xf,)]^/ is bounded by 



+ 



V~s 



-. By the 



construction of S', the smallest entry in S' is the sth largest entry in [(1 + 7)xt — U~^U{'x.t 
^*)]s ' ^^ conclude that at most s entries with magnitude larger than — '■ — ||xi 



V~s 



X*||2- 

As an immediate result of Theorem 2, we prove the following Corollary. 



Corollary 3 Let St be the support set ofxt andS^, be the support set ofx^:. If\StUS^\ < 2s 
and Tt > "'" r^ ^ ||xf — x*||2, then \St+i ^ S^\ <2s and \S^ U 5j U St+i\ < 3s. 

Proof As shown in (7), xj+i is given by 

1 



xt+i = sign{xt) 



1 + 7 



;i + 7)xt-v£(xt; 



n 



By Theorem 2, we know that there are at most s entries in 



(1 + 7)xi - V£(xi) 



-s. 



are 



larger than {6s,s + Ss + 7) ||xt — x=k ||2/-v/i, therefore [xi+i]^ has at most s non-zeros entries. 
It concludes that {St+i U 5=k| < 2s and |5=k U 5( U 5f+i| < 3s. ■ 
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Theorem 4 // we assume ||xi — x^l^ < Aj = {AjY^^R"^, set Tt = -^ — ^^ Aj and 

7 > max((53s, 9s, s + f^s); i/ien we have 

||xm-x.||i<AiV = (47)*^' 

Proof Let T = S^, USfUSt-^-i, by Corollary 3, we have IT] < 3s, therefore WU^U-j- — 1\\2 < 
63s- Next, we proceed the proof as follows: 

£(xi+i) =£(xi) + (xi+i -xt,V£(xt)) + -||xi+i -xt||2 

= £(xi) + (xi+i -Xi,V£(xt)) + (xt+i -Xf,V£(xt) - V£(xt)) + -||xt+i -XfH^ 

< £(xt) + (xt+i -xt,V£(xj)) + (xi+i -Xi,(/- U^U){^t -X*)) + -||xi+i -xj^ 

< £(xi) + (xi+i -Xi,V£(xt)) + ||/- [/-T^C/rlbllxi+i -Xi||2||xi -x=^||2 + -||xt+i -Xilll 

< £(Xi) + (Xi+i -Xi,V£(xt)) +53^11x4+1 -Xt||2||Xi -X*||2 + -||xi+i -Xi||| 

< £(xi) + (xi+1 - Xi, V£(xi)) + ^^||xt+i - xtll^ + ^||xi - x,||^ 



2 " '^' ""^ 27 

^ 1 + 7 

< £(xi) + (xt+i -xt,V£(xt)) + ri||xt+i||i H — ||xi+i -xt||| 

'^Ln „2 „ „ 

+ TJ^IIxt -x=^||2 -ri||xt+i||i 

27 
^ r^ \ I / fj rf W I II II I "'"Til Il2 "'"^ll ||2 

< £(xi) + (x* -xt,V£(xi)) +rt||x*||i H ^11^* -xtib ^l|xt+i -x*||2 

H ^||xj — x*||2 — r(||xi+i||i (By optimality of x^+i and the strong conveity) 

27 



Define 

4- A? /o/ . . --^ I / 2 

x=^-xt||2+rf (||x=^||i - ||xi+i||i)+(x=^-Xi, V£(xi)-V£(xi)) -— ||xi+i-x=^||2 



7 + 'Ji/7|i„ „ 112 , _ /ii„ II ii„ II ^ , /„ „ ^r^,.^ v7/^/„^\ 1+7, 



2 
We have 



£(xi+i) < £(xi) + (x* -Xi,V£(xi)) + -||x^, -xt||2 + r( = £(x^,) +Tt = Tt 
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where last equality follows from £(x*) = 0. Next, we bounded Tt by 

Tt = ^^i^^^llx, - Xi||2 + Tt (||x*||i - ||xi+i||i) + (x, - xt, V£(xi) - V£(xt)) 



l + 7i 
2 



l|2 
|Xt+l — X*||2 



< 1±3sI1aI + ^£:l±ii±lA,Vi||x*+i - x,|b + (x, - x„ (/ - U^U){^, - X,)) 

2 V* 

^ + ^11 l|2 

2 — ll^t+i ~ x*ll2 

< :r + ^ ^t + oll^t+l - X*ll2 + '^2s||xt - X*||2 ^— ||xt+l - X*||2 



2 2 7^2 



Since £(xi_|_i) = ||xt+i — x^,|||/2, we have 

^-— ||Xt+l -X*|| < \ \-d2s]^t 



|xt+i - x,||i < —— 7 + — + 262s + iOs,s + Ss+7f Aj 



leading to 

1 + 7 V 7 

Since Sg is no-decreasing in s, if we assume 7 > max(53s,^s^s + 6s), we have 

II ||2 ^ 47 + 472 , . . 2 a2 

||xt+i - x*||2 < —^ A^ < 47 A^ = Aj+i 
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